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estimating the object’s pose, making it inaccurate for mobile augmented 
reality (MAR) applications. Objects augmented in the current system have 
Keywords: much jitter due to frame illumination changes, affecting the accuracy of 
vision-based pose estimation. This paper proposes to estimate the pose of an 
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Augmented reality mechanical system (MEMS) sensor (gyroscope) to minimize the jitter 
Gyroscope sensor problem in MAR. The algorithm used for feature detection and description is 
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Pose estimation for pose estimation, random sample consensus (RANSAC) is used. 
Furthermore, gyroscope sensor data is incorporated with the vision-based 
pose estimation. We evaluated the performance of augmenting the 3D object 
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method was superior to the existing vision-based pose estimation algorithms. 
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1. INTRODUCTION 

Augmented reality (AR) is the technology that evolved from virtual reality (VR). VR revolves 
around the computer-generated environment. However, AR is a technology that combines the real world and 
computer-generated information. Azuma et al. [1] have described AR in a novel way that includes real-time 
object augmentation. In the actual world, virtual and physical items must be mathematically aligned. The 
principle of AR is to integrate virtual information such as 3D models, images, text, video, music, and much 
more to the real environment, which further enhances the real world [2]. In recent years, AR applications 
have become increasingly ubiquitous. AR technology is applied in a wide range of fields such as tourism, 
medical, logistics, entertainment, maintenance, and much more [3]-[5]. It plays a huge role in tourism, 
including transportation, food, accommodation, museum, and more. It was anticipated that business that uses 
AR applications would get an advantage to make more progress and capture the market [6]. 
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Acquiring 3D pose estimation is the long-standing enigma in computer vision. 3D pose estimation is 
the process of anticipating an object’s orientation or relative position from a defined reference. Many 
applications take advantage of pose estimation in recognition, localization, object grasping, and mapping. 
The extraction of features (key points) from a scene/object is the first step in estimating a pose, which is then 
utilized to determine the relative pose of the object concerning the coordinate frame. However, because the 
camera projection only includes 2D data from the 3D world, the accuracy of the scene or objects may only be 
assumed to a limited extent. 

AR image registration allows augmenting or overlaying the virtual object into the real environment, 
allowing augmenting 3D objects using different computer vision techniques. However, 3D registration 
technology first determines the relationship between the virtual object and the orientation of the displaying 
device. Besides, the rendered object must be precisely augmented into the real scene to merge the virtual 
image and the model with the real environment [2], [7]. 

Vision-based pose estimation in AR applications has been widely explored. However, mobile 
augmented reality (MAR) applications have certain issues, such as jitter, illumination issue, and more. This 
paper aims to estimate the pose of an object by combining the vision-based techniques and micro electrical 
mechanical system (MEMS) sensor (gyroscope) to minimize the jitter issue in the pose estimation. First, the 
video is composed and then processed using the software. Next, the reference in the video is tracked 
accordingly. The reference in this scenario is our target in the video, where we intend to augment the virtual 
object. Although, many studies have been done regarding AR image registration which allows rendering the 
virtual object into the physical or real environment. However, when there is a change in orientation or 
illumination, the pose gets affected, disturbing the augmentation. To overcome this, we have recorded video, 
which comprises of series of movements like rotating, tilting, and different random motions, to check the 
robustness of our proposed algorithm. Secondly, oriented FAST rotated BRIEF (ORB) and random sample 
consensus (RANSAC) are used to perform feature extraction and homography, respectively. Finally, the 
augmentation part is initially done using a vision-based method only, and sensor data (gyroscope) is 
incorporated only when the extracted features are the threshold. The result significantly improves estimating 
pose and augmenting the 3D object by incorporating the gyroscopic sensor data. 

The arrangement of the rest of the paper is: section 2 presents the related works, and section 
3 presents the proposed method to incorporate the sensor data with vision-based pose estimation and augment 
the 3D object on a target surface. Section 3 further describes the feature detection and matching process, 
homography estimation, and gyroscope. Finally, section 4 shows the experimental results and discussion, 
followed by the conclusion in section 5. 


2. RELATED WORKS 

Augmenting virtual objects, whether 2D or 3D, on top of the physical layer is AR. The digital 
information (2D/3D objects) augmented in the real environment has immensely increased visualization 
technology. AR applications capture the images of the physical world and represent it with additional layers 
of data which is then displayed on the different digital screens [8]. Enhancing visual performance using AR 
technology has become the trend for big brands to capture the market. An approach made by [9] presents an 
AR framework that uses AR and shader effects. Shaders are scripts that include the mathematical 
computations and techniques needed to compute the color of each rendered pixel. The animated screen 
effects and light effects are achieved by shading. Furthermore, the techniques used in the framework make 
the AR scene more appealing and realistic by overlaying different virtual objects and making it more 
convenient to experience the 3D augmentation on their mobile devices. However, the systems cannot 
evaluate the pose estimation of the objects accurately. 

Accomplishing the goal of inserting the virtual objects in an image sequence accurately where the 
3D objects are rendered and aligned with the real environment is of great importance. Although AR has 
seamlessly allowed augmenting the 3D objects, the pose estimation or camera localization process issues still 
exist. In recent years, Marchand et al. [10] have presented a brief introduction to the different approaches 
related to vision-based pose estimation and camera localization issues. Additionally, the gap between 
practical implementation and theoretical aspects of pose estimation has been reduced. Eyjolfsdottir and 
Turk [11] proposed a multisensory method for estimating the transformation of mobile phones between 
images taken from its camera. The method used inertial sensors to support vision-based pose estimation by 
warping two images into the same perspective. Adaptive features from accelerated segment test (FAST) 
feature detectors and image patches are incorporated as key point detections. The results show a considerable 
improvement in matching the key point between two images. However, the study fails when there is a big 
transformation due to poor linear movement estimation. 

With MEMS sensors becoming more accurate, the camera pose estimation problem turned down. 
An approach regarding the 3D camera rotation and translation which are the extrinsic parameters of camera 
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pose estimation, was made by [12], in which the vision data and inertial fusion using simplified structure 
from motion for pose estimation. A gyroscope sensor was used to estimate the camera rotation parameter, 
while the translation parameter was estimated separately. Moreover, the camera rotation parameter assists in 
estimating the translation parameter using image data. Nevertheless, the limitation of the technique lies in the 
drift problems of gyroscopes that need to be calibrated after a long time. In addition, a pose estimation 
algorithm using depth information was proposed by [13]. The results shown by the latter study were more 
accurate than algorithms using both depth information and color information when evaluated against many 
scenes. 

Another research conducted by [14] proposed using integrated position sensors, universal serial bus 
(USB) position sensors, head-mounted display, and gyroscopic mouse in their City 3D AR project. The 
research is to be used in the field of architecture and urban planning. However, the project faces challenges in 
augmenting the 3D objects in the outdoor environment. 


3. PROPOSED METHOD 

This section describes the proposed solution for estimating the object’s pose and its reference using 
vision-based pose estimation along with gyroscope sensor data to precisely augment the 3D virtual object in 
the real environment. The virtual object to be augmented must remain intact on the target surface without any 
jitter. So, the position and orientation of the 3D virtual object match the position and orientation of the 
predefined target surface. Moreover, if the surface on which the augmented 3D object changes its position 
and orientation, the 3D object also changes accordingly. 

We incorporate the sensor data with vision-based pose estimation to reduce the jitter problem while 
augmenting the 3D object on the target surface to achieve our objective. Figure | indicates the procedure that 
begins with the video data acquisition followed by feature extraction of the target surface using the ORB 
algorithm and RANSAC to evaluate the homography for pose estimation of the target surface. After 
estimating the target’s pose, the method proceeds towards the augmentation based on vision data alone. 
However, if the key points are the set threshold because of the significant change in the target’s orientation, 
then we incorporate sensor data that defines the target position in three dimensions with the vision data. 
Thus, the vision and sensor data are aligned together to enhance the pose accuracy and improve the 
augmentation. 
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Figure 1. Overview of the stages associated with vision and sensor-based pose estimation 


3.1. Video data acquisition 

A mobile phone is utilized to capture video data of the reference where the virtual item is to be 
augmented. Even though the study aims to use real-time video input, a pre-recorded video is used to evaluate 
several algorithms. It is because the location variables of each key point change from one real-time session to 
the next when using a real-time video. As a result, the outcomes are not consistent with each execution of the 
real-time test. However, each key point is precisely the same location with a pre-recorded video for each test 
iteration, resulting in reliable findings for each test. In addition, we have further elaborated on the resolution 
and other details in the experimental section. 
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3.2. Target surface recognition 

The target detection/recognition begins after the video acquisition. The video consists of many 
frames and the target surface on which we augment the 3D object will also change. So, to augment the 3D 
virtual object, we will first detect the surface for virtual object then and only we can augment our virtual 
object precisely. The target recognition consists of several steps, which are discussed in the following 
subsections: 


3.2.1. Feature extraction 

In computer vision and other related image processing applications, two critical tasks are feature 
detection and matching. Image processing applications are expanding in multiple fields daily. Image 
matching algorithms are used in a wide range of applications, from simple photogrammetric tasks like feature 
detection to the development of sophisticated 3D modeling tools and image search engines. The feature of an 
image is the valuable information that helps solve the computational problem in computer vision 
applications. Feature extraction is the process of retrieving valuable information (interest points) from the 
image that includes minimizing the size of information representing a large amount of data. Moreover, image 
retrieval systems are mostly based on color, shape, texture, and layout [15], [16]. Therefore, the image 
features should represent some uniqueness, such as corners and edges. Also, it should be scale-invariant or 
invariant to any transformation. 


3.2.2. Feature description 

Descriptors provide the representation of the information acquired from the feature and 
surroundings. Descriptors encapsulate the feature vector of the object to be recognized, and the feature vector 
contains the descriptors of the interest points in the reference and target image. Many algorithms are present 
to extract the features from the image and compute its descriptors, such as speeded up robust features 
(SURF), scale-invariant feature transform (SIFT), ORB, and many more. The algorithm used in our project is 
ORB, as it provides better performance and minimal computational loads. ORB is an efficient feature 
detector and descriptor ideal for real-time situations where efficiency and speed are preferred [17], [18]. 
Also, it is free from any patent protection claims. Both the techniques of detection and description perform 
better and are low cost. 


3.2.3. Feature matching 

The task of determining the correlation between the reference and the target image is known as 
feature matching. The easiest method of doing this is to take the descriptor of each element in the primary 
set, compute the distance to all the descriptors in the subsequent set and return the nearest one as the best 
match. However, the matching task depends on the variations within an image and the image type to be 
matched. Certain parameters need to examine while matching the image: i) scale: the scales of at least two 
items of the set of images views differ, ii) orientation: the views of the images are rotated concerning one 
another, iii) affine transformation: whether it be a flat, angular, or textured object, iv) illumination: variation 
in illumination also arises a typical issue for efficient feature matching, and v) occlusion: two spatially 
separated objects in the 3D environment might look like one or get interference in the 2D image plane [19]. 
Moreover, a threshold should be defined on the number of matches found, which further demonstrates the 
minimum key points matched with reference and gives efficient recognition. 


3.3. Homography estimation 

Once the reference surface is recognized in the current frame and has several legitimate matches, we 
can continue to appraise the homography between the two images. First, we need to discover the 
transformation that maps the points from the image plane to the surface plane. Then, the homography matrix 
equation is used while estimating the pose of the given image. The coordinates PO, P1, P2, and P3 shift an 
image from the viewer’s perspective on an image plane and project it onto a world plane in a 
three-dimensional environment as shown in Figure 2 [20]. 

The transformation needs to be updated in every frame we process. Therefore, homography is 
anticipated to reflect a mapping from one picture plane to the next related to a rigid body transformation. 
Hence, it is expected that a rigid body keeps its shape during the acquisition of pictures and the 
transformation occurs only on the projected image surface if a change in camera view [21]. 

Using an existing algorithm to determine the homography would be easy as the reference and target 
image matches are found. The RANSAC algorithm can concurrently sort out the outliers based on the 
assessed model. It is an iterative technique of assessing the parameters of a numerical model from sample 
data containing both outliers and inliers. This algorithm works well in the presence of a large number of 
outliers. But determining accurate homography is a bit complicated, as we have to set the model so that there 
are minimum outliers. With the minimum outliers, the 3D object is augmented more accurately. 


Pose estimation algorithm for mobile augmented reality based on inertial sensor fusion (Mir Suhail Alam) 


3624 O ISSN: 2088-8708 


3D OBJECT 


Figure 2. Homography evaluation for pose estimation 


3.4. Pose estimation in augmented reality 

AR has been essentially a multidisciplinary and old field. Although, it is obvious that real-world and 
virtual world registration problems have piqued people’s interest. But from a wider perspective, this is a 
motion tracking problem. Different sensors have been considered, such as magnetic, mechanical, inertial, 
global positioning system (GPS), ultrasonic devices, and more, but unfortunately, there was no silver bullet 
to mitigate this issue [22]. The method of estimating the camera’s position and orientation from a collection 
of correspondences between 3D features and their image plane projections is called pose estimation. As a 
result, any errors in the camera estimation in the global frame would be obvious to the user. As a result, 
vision-based AR is restricted to a camera pose estimation problem. Three angles of rotation and three angles 
of translation can be used to reflect a pose. However, at least three points can be used to approximate a 3D 
pose. Researchers have tried various techniques, including the P-n-P problem and simultaneous localization 
and mapping (SLAM), to estimate the pose based on the available data, either 3D or 2D. Although the P3P 
technique mitigates the pose estimation problem, it produces more reliable results by increasing the points. 


3.5. Inertial sensors 

Gyroscopes are mechanical gadgets that are used to measure the angular rate of rotation. MEMS 
gyroscope and magnetometers innovation currently give this function in various packages that are generally 
integrated into a sensor module or chip and broadly used in a variety of applications [23]. MEMS gyroscope 
utilizes a minuscule micromechanical framework on silicon structures, supporting the motion to electrical 
transducer functions. Itis mostly used in the navigation system to deliver the heading estimation. 

The gyroscope measures the angular velocity along three axes. As a result, it cannot predict roll, 
pitch, or yaw. However, as we can see, integrating angular velocity over time produces the angle, which can 
then be utilized to determine roll, pitch, and yaw changes even though gyroscope readings are sometimes 
erroneous due to fast motion. In general, we calculate roll and pitch using (1). 


$ = arctan (=) ;0 = arctan () (1) 
ay ay 
The (2)-(4) are used to calculate the orientation and to rotate the acceleration vector [11]. 


0 = ay cos d — w, sind 
$ = w +t (wy sin p + wx cos p) tan 0 


Y = (w, sind + @, cos ġ)/ cos 8 (2) 
cos y% cos 8 sin w cos 0 —sin 0 
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where C/ is the rotation matrix used to rotate the local acceleration vector; 0, @, W depicts pitch, roll, and 
yaw, respectively. In the MAR application, gyroscopes can assist in recording the changes in the target’s 
orientation to estimate the pose accurately and precisely augment the 3D object on a target surface. So, to 
enhance the pose estimation accuracy, the gyroscopic sensor can be incorporated with the vision-based 
method. 


3. EXPERIMENTAL RESULTS AND DISCUSSION 

This section shows the experimentation and the results obtained to estimate the pose and augment 
the 3D object in a real environment. The steps involved in the proposed algorithm include video data 
acquisition, keypoint detection and description, homography, pose estimation, and is followed by 
incorporating the gyroscope sensor data to enhance the efficiency of the augmentation, which can be obtained 
only by estimating the pose accurately. The tools used during the experimentation are matrix laboratory 
(MATLAB) and Python using OpenCV (a popular computer vision library). Moreover, the gyroscope sensor 
reading was recorded using an android physics toolbox sensor suite. 

The video is recorded using an Android smartphone Xiaomi Mi A1 that has a 20-megapixel camera. 
The resolution of the video is 1080x1920 pixels while recording the video, and the duration of the video is 
34 seconds with a rate of 30 frames per second which results in 1020 frames. However, we have used plenty 
of videos with different video lengths during experimentation, but eventually, we selected the video of 
34 seconds for the results. In addition, the video comprises a series of camera movements like tilting, 
rotating, and random motions to show the system’s robustness. 


4.1. Quantitative analysis 
4.1.1. Ground truth analysis 

The key points that appear in all the video frames are the ground truth of our method. To process the 
video and extract all the frames, we used MATLAB. After extracting the frames, we used the ORB algorithm 
to obtain all the frames key points. The ORB algorithm is a hybrid of a modified FAST detector and a 
modified binary robust independent elementary features (BRIEF) descriptor. FAST detects key points by 
scanning the pixel along with its neighboring pixel p within the radius r. The new pixel p detected as a key 
point is determined by the surrounding pixels within the radius r. If their intensity differs significantly from 
that of the candidate pixel p, only the new key-point is detected. Figures 3(a) and 3(b) depicts the key points 
(green dots) of our target surface and strong 5 key points in each frame of the recorded video respectively. 
Since each frame is only 2D after extraction of the frames from the video, the position of the key points is 
only denoted in x and y coordinates. Figure 4 shows the ground truth which encapsulates all the keypoints in 
the frames. 
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Figure 3. The key points (a) key points of the target and (b) 5 strong key points in each frame 
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Figure 4. Ground truth of all video frames 


4.1.2. Vision-based analysis 

The vision-based approach for determining the mobile device or static object pose estimation is 
widely used. The vision can be taken as a video or an image to interpret the environment. It represents a 
spatial connection between the captured 2D image and the 3D points on the scene. The use of AR markers 
makes the alignment of the virtual object with real environments very efficient way. Moreover, it improves 
stability and reduces requirements for computation. Furthermore, while analyzing the other vision-based 
approaches such as [24], [25], they have used enhanced techniques. However, the efficiency and robustness 
of a vision-based method depend on the performance of the feature extractor, i.e., inadequate feature 
extraction in images results in failure of the pose estimation. So, robust pose estimation based on vision is yet 
to be achieved. However, the incorporation of sensors such as gyroscopes, accelerometers with vision-based 
techniques has minimized this issue. 

The recognition of objects and feature matching in AR is vital under uncontrolled, real-world 
conditions. It is important to create object-based environmental representations and to manipulate objects. 
Object recognition means recognizing a particular object or reference (e.g., ID card) in our work. Numerous 
approaches are presently being used to detect, recognize, and classify objects with scale-invariant descriptors 
and detectors. Among the algorithms available include ORB, SIFT [26], SURF [27], Gaussian of difference 
(DOG), and many more. Object detection and acknowledgment can be done by computer vision, which 
detects an object in an image or video. The object recognized is used to identify an object position or a scene 
[28], [29]. Object recognition is based on some criteria that include appearance, feature, and color based. 
However, each algorithm has its advantages and disadvantages. 

The algorithm used in our experiment is ORB, being the better feature detector and descriptor and 
suitable for real-time situations that favor efficiency and speed. In the vision-based system, the object’s 
points of interest in the image match it with the reference in a similar scene or image. After extracting the 
features, we have to appraise the homography between the frames. RANSAC [30] is used here because it can 
calculate parameters with high precision even if it includes a significant number of outliers. Figures 5(a) and 
5(b) showed the feature matching of the frame and homography estimation using RANSAC, respectively. 

Figure 6 shows the frames concerning the matching key points. The vision-based method has good 
matching points until there is illumination and tilt in our reference. We have set a threshold for key points. 
Until the key points are above the threshold, our augmentation based on vision works well. When the 
matching key points get reduced due to the tilt or illumination in the target, it cannot estimate the pose well, 
affecting the augmentation. Frames from 400 to 700 have interest points the set threshold because of the 
target’s random motions and orientation changes. Furthermore, the decrease in key points affects pose 
estimation and augmentation of a 3D object. To overcome this issue, we have used inertial sensors to 
incorporate the gyroscope data that can help in mitigating the problem. Figure 7 shows that vision-based pose 
estimation works well when key points are above the threshold, which helps in 3D augmentation. However, 
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we incorporate the sensor data with vision data to enhance pose estimation accuracy when the key points are 
the threshold. 


(b) 


Figure 5. Feature matching and homography (a) frame feature matching and (b) homography using RANSAC 
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Figure 6. Matching key points versus frame number Figure 7. Matching key points above and below the 
threshold 


4.2. Sensor fusion 

Sensor fusion aims to improve pose estimation performance by combining and integrating vision 
data with the sensor data. It is not possible to provide exact details using an inertial measurement unit IMU) 
sensor alone. Hence vision is often used. However, vision data alone cannot accommodate occlusion, rapid 
movement due to the camera’s scope. Thus, the fusion of both IMU (gyroscope data) and vision data would 
better estimate poses with the deficiency in the vision-based method [31]. 

Different approaches based on sensor fusion have been made by [32], [33]. Assa and Janabi-Sharifi 
[32] has proposed an extended Kalman filter (EKF) based sensor fusion approach for pose estimation where 
multiple camera measurements were fused using a combination two-camera configuration. Assa and 
Janabi-Sharifi [33] proposed the new techniques of multi-camera sensor fusion based on virtual visual 
serving for efficient and reliable pose estimation. However, the methods primarily concentrate on the fusion 
of pose estimation, which has yet to be implemented in visual guidance applications such as grasping. 
Furthermore, the sensor fusion techniques assume that the target is in the intersection field of view (FOV) of 
all the cameras [34], [35]. Another main issue in using EKF based multi-camera method is the increase in 
computational cost. 

In our experimental method, we mounted the target on the mobile phone and recorded the inertial 
sensor data (gyroscope) using that phone, and we have placed the camera in front of our target. A gyroscope 
senses the changes in the target’s position and measures these angular rates of rotation. Built-in gyroscopes in 
our handled devices help us measure these readings. Our application is then recording it for further analysis. 
It assists in enhancing the accuracy of the estimated pose. Figure 8 shows a graphical representation of 
gyroscope data in the x, y, and z-axis separately. 
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Figures 8 and 9 shows the gyroscope data (x, y, and z axis) independently and combined 
respectively. In addition, Figure 9 indicates that when there is a big orientation change of our target, the key 
points of our frames also get reduced. Thus, affecting the accuracy of the estimated pose and augmentation as 
well. Our proposed algorithm uses this sensor data and the vision data to enhance the accuracy of the 
estimated pose. When the key points get reduced the set threshold due to illumination or orientation changes, 
we fuse the sensor data with vision data to improve pose estimation and augmentation accuracy. 


gyro y 
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gyro z 
Oo 
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-0.5 
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Figure 8. Sensor data of x-axis, y-axis, z-axis Figure 9. Combined gyroscopic data 


4.3. Qualitative analysis 

We found significant improvement in pose estimation performance after applying the proposed 
algorithm based on sensor fusion (gyroscope data) and vision data. However, there were some good results 
while using vision data when the target’s orientation did not tilt that much. Nevertheless, it could not estimate 
the pose accurately when there is a significant change in the target’s orientation. Furthermore, due to a 
significant change in the target's position, many key points were not extracted, eventually affecting the 
augmentation. 

Figure 10 shows frames from 431-433, which has a significant change in the target, causing jitter in 
the estimated pose and reducing the key points. Therefore, the performance in these frames degraded even 
though the target is visible and in the frame. Nonetheless, we used the gyroscopic data, which measures the 
angular rate of motion as mentioned earlier. Since the target could not estimate the pose properly due to a 
major change in the orientation, we fused the sensor data with vision data, we observed a significant 
improvement in the performance of the results, as shown in Figure 11. When the key points get reduced the 
defined threshold, the sensor data aids the vision data in figuring out the pose and simultaneously helps in 
augmentation. Figure 10 shows the same frame as in Figure 11, but the latter is based on vision data, and the 
other is after incorporating the sensor data with it. 


Frame 431 Frame 432 Frame 433 
Figure 10. 3D augmentation frames based on vision data 
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Frame 431 Frame 432 Frame 433 


Figure 11. 3D augmentation frames based on vision data and sensor fusion 


4. CONCLUSION AND FUTURE WORKS 

A unique method of pose estimation based on vision data incorporating sensor data (gyroscope) and 
3D augmentation was introduced with low computational cost. Augmentation of the virtual 3D object in the 
real environment is greatly significant due to the increased demand for AR in every field, such as medical, 
tourism, education, and more. This paper presented an algorithm that incorporates sensor data with 
vision-based pose estimation to reduce the jitter issue due to illumination or change in orientation in the 
frame sequence while estimating the pose. We used the ORB algorithm for the feature detection and 
description process because of its minimal computational loads and is ideal for real-time situations. The 
RANSAC algorithm determined the homography of the reference and target image, which is robust to sorting 
out the inliers and outliers. The performance of vision-based pose estimation was compared with and without 
gyroscope sensor data. The experimental results show better performance using vision-based pose estimation 
with sensor fusion. Finally, the virtual 3D object was augmented on the predefined surface whose rotation 
changes continuously, the issue of change in the orientation and jitter was minimized. In future work, the 
limitations of the sensors will be addressed, such as the effect of drift on gyroscope and poor accuracy in 
magnetometer and accelerometers under fast motion. 
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